-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded build #495
Multithreaded build #495
Conversation
Summary of CI issues:
|
src/annoylib.h
Outdated
for (int i = 0; i < n_threads; i++) { | ||
int trees_per_thread = -1; | ||
if (q > -1) { | ||
// First thread picks up the remainder of the work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might lead to an uneven split, eg if q=20
and n_threads=7
then the distribution will be [8, 2, 2, 2, 2, 2, 2]
I think you can do trees_per_thread = work_per_thread + (i < work_remainder)
and it should split it like [3, 3, 3, 3, 3, 3, 2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually even simpler you can just to trees_per_thread = (q + i) / n_threads
and you don't even have to define work_pre_thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I was not exactly happy with my solution, so I'll reuse this.
src/annoylib.h
Outdated
@@ -1172,7 +1206,87 @@ template<typename S, typename T, typename Distance, typename Random> | |||
return get_node_ptr<S, Node>(_nodes, _s, i); | |||
} | |||
|
|||
S _make_tree(const vector<S >& indices, bool is_root) { | |||
void _thread_build(int q, int thread_idx, ThreadBarrier& barrier, std::mutex& _nodes_mutex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would be good to have the same code for threaded and non-threaded building
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure what you mean by this: should we still support building without threads (e.g. when n_threads=1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually be ok only supporting threaded build so that we can remove the non-threaded build code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but _thread_build
is just a private helper function that was extracted from the previous build
function. It is there just so I don't have to stuck everything inside a lambda function when spawning a new thread.
Could you point out which part of the code could be potentially removed? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _thread_build
could be a pretty simple wrapper around _make_tree
. I think the only two places that touch any state are:
- https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1185
- https://github.com/spotify/annoy/blob/master/src/annoylib.h#L1252
I don't think you need the magic where you go into nodes and add offsets etc? So all _thread_build
would have to do is to keep adding trees until you reach the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, that's what i was thinking.
Your approach at https://github.com/spotify/annoy/pull/495/files#diff-91eaadc9d0aa248375a22a64f315e3d8R1250 is very clever but i'm a bit nervous it's too hacky and I'd love to have something slightly simpler :)
Doesn't your barrier thing roughly accomplish what at shared_mutex does? Or it should at least be possible to modify it into sort of a shared mutex? Haven't thought about it enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I would definitely feel more comfortable if I could simplify it a bit 😄
Unfortunately a barrier does not accomplish what a shared_mutex is supposed to do. It can be only used as a global synchronization point, not for reader/writer style locking.
After reading a couple of SO threads, a custom shared mutex could prove really tricky to implement correctly. We could use the shared_timed_mutex
which has a few additional capabilities and is available from C++14. Either way, I'll implement the less-magic version we discussed and then we can compare the performance and decide if it's worth it to bump the C++ version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I managed to simplify my solution in the latest commit 😄 I removed all of the magic and introduced a couple of more mutex
variables and a shared_timed_mutex
variable.
I've re-ran my time measurements on glove-angular-25:
Num. threads | n_trees=32 time (s) | n_trees=64 time (s) |
---|---|---|
1 | 49.286592960357666 | 99.88854050636292 |
2 | 26.54487133026123 | 53.29321575164795 |
4 | 15.026970863342285 | 29.342100143432617 |
8 | 10.422403812408447 | 18.966835498809814 |
16 | 9.757703065872192 | 17.069363117218018 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow amazing! Any idea if there's any slowdown compared to the current master?
The new code looks much cleaner and less "clever". I don't know how well supported c++14 is but hopefully it's not a big problem.
Do you really need so many different mutexes? Not a concern, was just surprised to see 4 different ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can prepare a more comprehensive timing benchmark and include a comparison with the current master. But as far as I can tell, there isn't a significant slowdown.
Regarding the number of mutexes: If I guard everything with the shared nodes_mutex
just one thread has to acquire an exclusive lock and everything grinds to a halt. So I opted for a more granular approach, where I exclusively lock the nodes_mutex
only when absolutely necessary and add a separate regular locks for all other shared variables (nodes_size
, _n_nodes
). I'll see if I can cut down on the number of mutexes, but I currently can't see a better way.
src/annoylib.h
Outdated
std::mutex _nodes_mutex; | ||
ThreadBarrier barrier(n_threads); | ||
vector<std::thread> threads(n_threads); | ||
int work_per_thread = (int)floor(q / (double)n_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if q = -1
? you probably have to determine q
before this code in case it's set to -1
this is super cool! i left a few minor comment, but generally this is great |
src/annoylib.h
Outdated
@@ -929,7 +952,7 @@ template<typename S, typename T, typename Distance, typename Random> | |||
return true; | |||
} | |||
|
|||
bool build(int q, char** error=NULL) { | |||
bool build(int q, int n_threads=1, char** error=NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to default it to -1
and if that's the case set it to the number of cpus
vector<vector<Node*> > thread_trees; | ||
vector<S> thread_roots; | ||
while (1) { | ||
if (q == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code to determine q
if q = -1
could probably just be left in the build
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with q = -1
, is that I don't know beforehand how many trees each thread should build, because it depends on the number of nodes. That is why I have to check for -1
everywhere.
A better solution would be to introduce a different meaning to q = -1
, where each thread would build just one tree for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good point. I forgot about that. Ok if your code works then let's keep it, I guess
@erikbern Could you take a look at the failing Go and Lua builds, if you'll be able to help? I'm not sure how to get the -std=c++11 flag into the compilation. Windows Python 2.7 might be tricky, because it seems it doesn't support C++11 thread/mutex at all. |
src/annoylib.h
Outdated
} | ||
|
||
// Wait for all threads to finish before we can start inserting tree nodes into global _nodes array | ||
barrier.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't locking the mutex enough to enforce this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, due to reasons noted here: #495 (comment) We have to wait for all threads finish their work before we can mutate the global _nodes buffer.
src/annoymodule.cc
Outdated
if (!self->ptr) | ||
return NULL; | ||
static char const * kwlist[] = {"n_trees", NULL}; | ||
if (!PyArg_ParseTupleAndKeywords(args, kwargs, "i", (char**)kwlist, &q)) | ||
static char const * kwlist[] = {"n_trees", "n_threads", NULL}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an aside -- what do you think about naming this parameter in the Python wrapper to n_jobs
instead of n_threads
? n_jobs
seems more consistent with the Python ecosystem. Either way, I think this is a great improvement!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense. I can expose n_jobs
keyword argument for Python, but keep n_threads
everywhere else.
…h uses all CPU cores. Rename n_threads to n_jobs in the Python interface.
This looks quite amazing so far. Would be good to figure out the CI failures |
As far as I can tell Windows Py2.7 does not support the C++ thread/mutex libraries in any way. For Go and Lua I need to add the -std=c++14 flag to the compilation step, but I'm not really sure how to do that. |
maybe there's some dumb way so that if those headers aren't present, it just defaults to single-threaded? not sure if that's a lot of work. i'd imagine there would be a lot of |
FYI the tests timed out on OSX which implies there might be some race condition or similar? https://travis-ci.org/github/spotify/annoy/jobs/713675853 |
…aded policies. Added ANNOYLIB_MULTITHREADED_BUILD which controls which threading policy to use.
nice, the last change is 👌 |
Feel free to take another look and see if anything can be improved. OSX is going wonky, I'm still investigating the issue. |
I think I figured out the OSX issue. Apparently locking/unlocking is much more expensive under OSX and I'm doing it excessively in a loop when adding indices. Moving the lock around the loop should do the trick 🤞 |
…because OSX should be faster now.
this looks great – i'm happy to merge whenever you give 👍 |
Awesome 👍 I'll just prepare a short timing comparison with master sometime tomorrow. If there aren't any more surprises, it'll be ready to go. |
Here is a multihreaded build timing comparison for angular-glove-25 dataset:
Here are single threaded timings from master for comparison:
The speed up is not linear, but it's good enough especially for large number of trees. I think it's ready to 🚢 |
Awesome, I’ll merge this now. I won’t publish this to pypi for a few weeks just in case there’s some issues. There’s some people using master so that’s effectively a form of “beta testing” Really appreciate the work! |
Awesome, sounds good! 😄 |
Hey there, we are already using the multithreaded build with python and love the speed improvements! Now we are even thinking about migrating to golang in order to have better performance when loading the data before inserting it into the tree.
Is there a specific issue that blocked this from getting implemented? Do you plan to follow-up with a PR that enabled multithreaded builds, or should I try sending a PR? |
Hey @n1ru4l, no I wasn't planning on it. You can give it a go, if you would like :) As I've mentioned, you would only need to supply the -std=c++14 flag when compiling for Go and possibly update the Go version in tests (https://github.com/spotify/annoy/blob/master/tox.ini#L13). |
@novoselrok That sounds easy in theory :D I will give it a try! |
So we decided against using golang. I won't implement this any time soon then. |
I'm running this in docker on ubuntu, but I'm only seeing 1 CPU core used at a time (100%), the rest are idle (0%). When I'm training a tensorflow model in the same docker, it uses 100% of the CPUs, so I don't think docker configuration is the issue.
|
I am facing same issue as @utkarshgupta137, did you find any solution for this? |
It has been a while, but I don't think I did. |
I see. That's unfortunate. I have created an issue here -> #627 hopefully the author will respond. |
Issue: #70
This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this 😄
I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s.
Here are results for multiple threads:
If you have any questions or concerns feel free to comment.